Prithviraj Kadiyala

Prithviraj Kadiyala

1 My Video

2 Introduction

Salary - A fixed regular payment, typically paid on a monthly or biweekly basis but often expressed as an annual sum, made by an employer to an employee, especially a professional or white-collar worker.
Salary

The following was taken from Forbes Articles

There has been a lot of buzz going around in the software industry about unequal pays to the employees who have been working for a single company longer. and Who recently graduated and gets very good salary. There also have been a lot of articles written about unequal pays to the loyal employees and employees changing their jobs every 3-4 years getting almost 50% hike in salaries.

Those very new to the tech industry, with less than a year of experience, can expect to earn $50,321 (a year-over-year increase of 9.8 percent). After a year or two, that average salary jumps to $62,517 (a whooping 24.3 percent increase, year-over-year).

Spend three to five years, and the average leaps yet again, to $68,040 (a 6.3 percent increase). Between six and ten years in the industry, salaries hit $83,143 (a rise of 6.8 percent).

Breaking the ten-year mark translates into big bucks. Those with 11 to 15 years of experience could expect to pull down $96,792 (a 3.8 percent increase over last year), while those with more than 15 years average $115,399 (a 6 percent increase).

Below is the graph that shows us the salary hike when employees jump companies: Salary

The data was collected here:

2.1 What are the variables?

dataset = read.csv("Emp_Salary.csv",header=TRUE,sep=",")
head(dataset)
names(dataset)
## [1] "Employee" "EducLev"  "JobGrade" "YrsExper" "Age"      "Gender"  
## [7] "YrsPrior" "PCJob"    "Salary"

2.1.1 Plot data

library(s20x)
## Warning: package 's20x' was built under R version 3.4.4
pairs20x(dataset)

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.4.4
g = ggplot(dataset, aes(x = YrsExper, y = Salary, color = EducLev)) + geom_point()
g = g + xlab("Years of Experience") 
g = g + geom_smooth(method = "loess")
g

2.2 How were the data collected?

2.3 What is the story behind the data?

2.4 Why was it gathered?

2.5 What is your interest in the data?

With the prospects of working in the software industry in the future. It would really cool to analyze the working of the IT industry beforehand and be prepared with what to do and when to do given the circumstances can put me into really good perspective of getting into the market and negotiating for a higher base salary package.

2.6 What problem do you wish to solve?

I would like to look into the dataset and try to relate if the salary is dependent on years of experience directly. If not, then do we see a lot of variation in the graph as the years of xperience increases. Some people clain that only technical knowledge will take us to a position that pays us well. It’s true to some extent but we will study about how having experience will have an impact on one’s salary. So the problem we define today is going to be “Is years of experience alone enough to increase your salary or does having years and years of experience doesnt have an effect on your salary at all”.

3 Theory needed to carry out SLR

I believe that the value of Salary (Y) tends to increase as Years of Experience (X), that is, when the employee has increasing years of experience his salary also tends to increase. I want to make a model relating the two variables to one another by drawing a line through all the points. I will define salary as my dependent variable, and years of expeirence as my independent variable.

I could use a deterministic model if all the data points were perfectly aligned and I didn’t have to worry about errors in my prediction; however, I know from the preliminary graphs that the data points are not perfectly aligned. A probabilistic model will be more accurate, in this instance, as it will take into account the randomness of the distribution of data points around the line. A simple linear regression model (hereafter referred to as SLR) is one type of probabilistic model, and will be used in my data analysis. SLR assumes that the mean value of the y data for any value of the x data will make a straight line when graphed, and that any points which deviate from the line (above or below) are equal to ??. This statement is written as:

\[ y= \beta_0 +\beta_1x_i+\epsilon \]

When \(\beta_0\) and \(\beta_1\) are unknown parameters, \(\beta_0+\beta_1x_i\) is the mean value of y for a given x and \(\epsilon\) is the random error. Working with the assumption that some points are going to deviate from the line, I know that some will be above(positive deviation) and some below(negative deviation), with an \(E(\epsilon)=0\). That would make the mean value of y: \[ \begin{align} E(y)&=E(\beta_0+\beta_1x_i+\epsilon_i)\\ &=\beta_0+\beta_1x_i+E(\epsilon_i)\\ &=\beta_0+\beta_1x_i \end{align} \]

Thus, the mean value of y for any given value of x will be represented by \(E(Y|x)\) and will graph as a straight line , with an intercept of \(\beta_0\) and a slope of \(\beta_1\)

The regression has five key assumptions:

  1. Linear Relationship
  2. Multivariate Normality
  3. No or little multicollinearity
  4. No Auto co-relation
  5. Homoscedasticity

4 Validity with mathematical expressions

In order to estimate \(\beta_0\) and \(\beta_1\) we are going to use the method of least squares.As discussed in class this helps us determine the line that best fits our data points with the minimum sum of squares of the deviations. This is called the SSE or Sum of Squares for Errors. In a straight line model, we have already discussed that \(y= \beta_0 +\beta_1x_i+\epsilon\). The estimator will be \(\hat y= \hat\beta_0 +\hat \beta_1x_i\) . The residual(the deviation of the ith valueof y from its predicted value) is calculated bvy \((y_i-\hat y_i) = y_i-(\hat\beta_0+\hat\beta_1x_i)\). Thus \(SSE=\sum^n_{i=1}[y_i-(\hat\beta_0+\hat\beta_1x_i)]\)

If the model works well with our data then we should expect that the residuals are approximately normal in distribution with mean = 0 and a constant variance.

The following function was taken from https://rpubs.com/therimalaya/43190

4.1 Checks on validity

4.2 Creating a linear model

dataset.lm=lm(Salary~YrsExper,data=dataset)
summary(dataset.lm)
## 
## Call:
## lm(formula = Salary ~ YrsExper, data = dataset)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -33054  -5782   -967   5792  30971 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 30329.73    1054.54   28.76   <2e-16 ***
## YrsExper      991.64      88.44   11.21   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8892 on 206 degrees of freedom
## Multiple R-squared:  0.379,  Adjusted R-squared:  0.376 
## F-statistic: 125.7 on 1 and 206 DF,  p-value: < 2.2e-16

We can get the values for the Intercept(\(\beta_0\)) and the estimate(\(\beta_1\)) from the summary above. \[ \begin{align} \beta_0 &= 30329.73\\ \beta_1 &= 991.64 \end{align} \]

4.3 calculating the Confience Interval for Parameter Estimates

ciReg(dataset.lm, conf.level=0.95, print.out=TRUE)
##             95 % C.I.lower    95 % C.I.upper
## (Intercept)     28250.6592          32408.80
## YrsExper          817.2669           1166.01

4.3.1 Least squares estimates

\[ \begin{align} \hat\beta_0 + \hat\beta_1x_i &= 30329.73 + 991.64* x_i \end{align} \]

The least-squares estimate of the slope, \(\hat\beta_1=991.64\), indicates that the estimated amount of increase of salary increases by 992$ for every year increase, with this interpretation being valid over the range of years of experience values. We can see that the increase in salary is dependent on Years of Experience. But lets also conduct a test with a quadratic model to fit the curve better and see if that will help us get more clearer on the result we just got.

5 Verifying Assumptions

plot(Salary~YrsExper,bg="Blue",pch=21,cex=1.2,
              ylim=c(0,1.1*max(Salary)),xlim=c(0,1.1*max(YrsExper)),
              main="Residual Line Segments of Salary vs YrsExper", data=dataset)
abline(dataset.lm)

5.1 Plot of Residuals

plot(Salary~YrsExper,bg="Blue",pch=21,cex=1.2,
              ylim=c(0,1.1*max(Salary)),xlim=c(0,1.1*max(YrsExper)),
              main="Residual Line Segments of Salary vs YrsExper", data=dataset)
ht.lm=with(dataset, lm(Salary~YrsExper))
abline(ht.lm)
yhat=with(dataset,predict(ht.lm,data.frame(YrsExper)))
with(dataset,{segments(YrsExper,Salary,YrsExper,yhat)})
abline(ht.lm)

5.2 Plot of Mean

plot(Salary~YrsExper,bg="Blue",pch=21,cex=1.2,
             ylim=c(0,1.1*max(Salary)),xlim=c(0,1.1*max(YrsExper)),
             main="Mean of Salary vs YrsExper", data=dataset)
abline(dataset.lm)
with(dataset, abline(h=mean(Salary)))
with(dataset, segments(YrsExper,mean(Salary),YrsExper,yhat,col="Red"))

5.3 Plot of means with total deviations from the Line Segments

plot(Salary~YrsExper,bg="Blue",pch=21,cex=1.2,
              ylim=c(0,1.1*max(Salary)),xlim=c(0,1.1*max(YrsExper)),
              main="Total Deviation Line Segments of Salary vs YrsExper", data=dataset)
with(dataset,abline(h=mean(Salary)))
with(dataset, segments(YrsExper,Salary,YrsExper,mean(Salary),col="Green"))

5.4 Using MSS, RSS and TSS

RSS=with(dataset,sum((Salary-yhat)^2))
RSS
## [1] 16287668202
MSS=with(dataset,sum((yhat-mean(Salary))^2))
MSS
## [1] 9939439028
TSS=with(dataset,sum((Salary-mean(Salary))^2))
TSS
## [1] 26227107231

\(R^2\) is equal to \(\frac{MSS}{TSS}\), which means that the value calculated is the value for the trend line. The closer \(R^2\) is to 1, the better the fit of the trend line.

MSS/TSS
## [1] 0.3789758

This value indicates that the trend line is not a good fit for the data i have.

5.5 Trendscatter

trendscatter(Salary~YrsExper,f=0.5,data=dataset)

Here, we use trendscatter to look and get the feel of how the data is scattered. Then according to the scatter we will do specific analysis in order to get the best results for our research interest.

We can see that the data is concentrated towards the line and there are fewer and fewer datapoints as we move away from the trend line. By this we can say that the error is distributed pretty normal. But again this is only by the visual inspection and we will ahev to perform more analysis that will give us more accurate resultss. First thig we would do is try to get a linear model and see how it looks and go forward from there.

5.6 Find the residuals and Fitted Values

Yrs.res=residuals(dataset.lm)
Yrs.fit=fitted(dataset.lm)

5.7 Residuals vs Yrs of Experience values

plot(dataset$YrsExper,Yrs.res, xlab="YrsExper",ylab="Residuals",ylim=c(-1.5*max(Yrs.res),1.5*max(Yrs.res)),xlim=c(0,1.6*max(Yrs.fit)), main="Residuals vs Yrsof Experience")

It looks as though the residuals are somewhat about the zero on the y-axis, but the values are mostly aligned towards the bottom of zero, this indicates that there is no SIGNIFICANT deviation from the line of best-fit.

5.8 Trendscatter on Residual Vs Fitted

trendscatter(Yrs.fit,Yrs.res, xlab="Fitted", ylab="Residuals") 

5.9 Shapiro-wilk

normcheck(dataset.lm,shapiro.wilk = TRUE)

The p-value for the shapiro-wilk test is 0. The null hypothesis in this case would be that the errors are distributed normally.

\[\epsilon_i \sim N(0,\sigma^2)\]

The results of the Shapiro-wilk test indicate that we have enough evidence against to reject the null hypothesis(as the p-value is 0 compared to the standard of comparison 0.05) leading us to the conclusion that the data is not normally distributed.

6 Testing another model for comparison

\(y=\hat\beta_0+\hat\beta_1x_i+\hat\beta_2x^2_i\)
quad.lm=lm(Salary~YrsExper + I(YrsExper^2),data=dataset)

plot(Salary~YrsExper,bg="Blue",pch=21,cex=1.2,
   ylim=c(0,1.1*max(Salary)),xlim=c(0,1.1*max(YrsExper)),main="Scatter Plot and Quadratic of Salary vs YrsExper",data=dataset)
myplot = function(x){quad.lm$coef[1] + quad.lm$coef[2]*x + quad.lm$coef[3]*x^2}
curve(myplot, lwd = 2, add = TRUE)

Fitting a quadratic to the data does produce a visibly dissimilar result; it does not look completely linear, as we have seen prior to this during the linear model plot, but further analysis will clarify the results and allow us to make a final decision.

quad.fit = c(Yrs.fit)

6.1 A Plot of the Residuals versus Fitted Values

plot(quad.lm, which = 1)

normcheck(quad.lm, shapiro.wilk = TRUE)

The p-value is 0. Again, the results of the Shapiro-Wilk test indicate that we DO have enough evidence to reject the null hypothesis, leading us to again assume that the data is NOT distributed normally.

6.2 Summarize the model

summary(quad.lm)
## 
## Call:
## lm(formula = Salary ~ YrsExper + I(YrsExper^2), data = dataset)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -41717  -6013  -1285   5617  22271 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   36378.79    1565.92  23.232  < 2e-16 ***
## YrsExper       -199.45     251.95  -0.792    0.429    
## I(YrsExper^2)    38.49       7.68   5.012 1.16e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8413 on 205 degrees of freedom
## Multiple R-squared:  0.4468, Adjusted R-squared:  0.4414 
## F-statistic: 82.77 on 2 and 205 DF,  p-value: < 2.2e-16

Here we have the following values \[ \begin{align} \beta_0 &= 36378.79\\ \beta_1 &= -199.45\\ \beta_2 &= 38.49 \end{align} \]

And further we have the R-squared values in the summary whch we caan use further to compare with the linear model earlier to see which of the model fits the data better.

6.3 Calculting the Confidence Interval

ciReg(quad.lm, conf.level=0.95, print.out=TRUE)
##               95 % C.I.lower    95 % C.I.upper
## (Intercept)      33291.41402       39466.17626
## YrsExper          -696.19349         297.29158
## I(YrsExper^2)       23.35142          53.63647

So the equation comes out to be \[ \begin{align} \beta_0 + \beta_1*x_i+\beta_2*x^2_i&= 36378.79 -199.45*x_i + 38.49*x^2_i \end{align} \]

7 Model selection

7.1 Making Predictions using the Two Models

7.1.1 For Model1

amount = predict(dataset.lm, data.frame(YrsExper=c(10,25,40)))
amount
##        1        2        3 
## 40246.11 55120.69 69995.26

7.1.2 For Model2

amount2 = predict(quad.lm, data.frame(YrsExper=c(10,25,40)))
amount2
##        1        2        3 
## 38233.68 55451.24 89991.07

The predictions made using the first model (linear) are greater in the beginning and smaller in the end compared to the predictions made by the second model (quadratic), but they are extremely close to one another. Further comparisons will be necessary to determine which model will be the best fit for the dataset.

7.2 Use adjusted \(R^2\)

summary(dataset.lm)
## 
## Call:
## lm(formula = Salary ~ YrsExper, data = dataset)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -33054  -5782   -967   5792  30971 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 30329.73    1054.54   28.76   <2e-16 ***
## YrsExper      991.64      88.44   11.21   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8892 on 206 degrees of freedom
## Multiple R-squared:  0.379,  Adjusted R-squared:  0.376 
## F-statistic: 125.7 on 1 and 206 DF,  p-value: < 2.2e-16
summary(quad.lm)
## 
## Call:
## lm(formula = Salary ~ YrsExper + I(YrsExper^2), data = dataset)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -41717  -6013  -1285   5617  22271 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   36378.79    1565.92  23.232  < 2e-16 ***
## YrsExper       -199.45     251.95  -0.792    0.429    
## I(YrsExper^2)    38.49       7.68   5.012 1.16e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8413 on 205 degrees of freedom
## Multiple R-squared:  0.4468, Adjusted R-squared:  0.4414 
## F-statistic: 82.77 on 2 and 205 DF,  p-value: < 2.2e-16

The multiple \(R^2\) for the linear model is 0.379; the adjusted \(R^2\) is 0.376. The multiple \(R^2\) for the quadratic model is 0.4468; the adjusted \(R^2\) is 0.4414.

According to these results, both the models have a low multiple \(R^2\) value. But the quadratic linear model has a significantly higher value which tells us that the Quadratic model fits the data better than the Linear model. “Linear regression calculates an equation that minimizes the distance between the fitted line and all of the data points. Technically, least squares regression minimizes the sum of the squared residuals.” Website link

7.2.1 Check on outliers using cooks plots

Remember to interpret this plot and all other plots

8 Conclusion

8.1 Answer your research question

8.2 Suggest ways to improve model or experiment